Back

British Journal of Ophthalmology

BMJ

All preprints, ranked by how well they match British Journal of Ophthalmology's content profile, based on 14 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Utilizing AI-Generated Plain Language Summaries to Enhance Interdisciplinary Understanding of Ophthalmology Notes: A Randomized Trial

Tailor, P. D.; D'Souza, H. S.; Castillejo Becerra, C.; Dahl, H. M.; Patel, N. R.; Kaplan, T. M.; Kohli, D.; Bothun, E. D.; Mohney, B. G.; Tooley, A. A.; Baratz, K. H.; Iezzi, R.; Barkmeier, A. J.; Bakri, S. J.; Roddy, G. W.; Hodge, D.; Sit, A. J.; Starr, M. R.; Chen, J. J.

2024-09-13 ophthalmology 10.1101/2024.09.12.24313551 medRxiv
Top 0.1%
19.1%
Show abstract

BackgroundSpecialized terminology employed by ophthalmologists creates a comprehension barrier for non-ophthalmology providers, compromising interdisciplinary communication and patient care. Current solutions such as manual note simplification are impractical or inadequate. Large language models (LLMs) present a potential low-burden approach to translating ophthalmology documentation into accessible language. MethodsThis prospective, randomized trial evaluated the addition of LLM-generated plain language summaries (PLSs) to standard ophthalmology notes (SONs). Participants included non-ophthalmology providers and ophthalmologists. The study assessed: (1) non-ophthalmology providers comprehension and satisfaction with either the SON (control) or SON+PLS (intervention), (2) ophthalmologists evaluation of PLS accuracy, safety, and time burden, and (3) objective semantic and linguistic quality of PLSs. Results85% of non-ophthalmology providers (n=362, 33% response rate) preferred the PLS to SON. Non-ophthalmology providers reported enhanced diagnostic understanding (p=0.012), increased note detail satisfaction (p<0.001), and improved explanation clarity (p<0.001) for notes containing a PLS. The addition of a PLS narrowed comprehension gaps between providers who were comfortable and uncomfortable with ophthalmology terminology at baseline (intergroup difference p<0.001 to p>0.05). PLS semantic analysis demonstrated high meaning preservation (BERTScore mean F1 score: 0.85) with greater readability (Flesch Reading Ease: 51.8 vs. 43.6, Flesch-Kincaid Grade Level: 10.7 vs. 11.9). Ophthalmologists (n=489, 84% response rate) reported high PLS accuracy (90% "a great deal") with minimal review time burden (94.9% [&le;] 1 minute). PLS error rate on initial ophthalmologist review and editing was 26%, and 15% on independent ophthalmologist over-read of edited PLSs. 84.9% of identified errors were deemed low risk for patient harm and 0% had a risk of severe harm/death. ConclusionsLLM-generated plain language summaries enhance accessibility and utility of ophthalmology notes for non-ophthalmology providers while maintaining high semantic fidelity and improving readability. PLS error rates underscore the need for careful implementation and ongoing safety monitoring in clinical practice.

2
Submission policy similarity and resubmission burden across the top 50 ophthalmology journals

Kaleem, S.; Tuitt-Barnes, D.; Maxwell, O.; micieli, J. A.

2026-03-24 ophthalmology 10.64898/2026.03.20.26348949 medRxiv
Top 0.1%
19.1%
Show abstract

After rejection, resubmission of scientific manuscripts often requires substantial journal-specific reformatting. We compared systematic review submission policies across high-impact ophthalmology journals and quantified policy similarity to support resubmission planning. We identified the top 50 ophthalmology journals by SCImago Journal Rank that publish systematic reviews and are not invite-only, extracted policies from author instructions using an a priori data dictionary, and computed pairwise similarity on a 0 to 1 scale using the Gower coefficient across mixed policy variables with available-case denominators for unstated fields. Policies were heterogeneous and frequently unstated. Only 29 of 50 journals (58%) stated a main-text word limit; among journals with numeric limits, the median was 4000 words (interquartile range 3500 to 5500; n = 23). Preferred Reporting Items for Systematic Reviews and Meta-Analyses compliance was explicitly required by 35 of 50 journals (70%), and prospective registration by 6 of 50 journals (12%). Across 1225 journal pairs, similarity was modest, with a median of 0.64 (interquartile range 0.57 to 0.71; range 0.05 to 0.98). Similarity among the top 5 highest-ranking journals ranged from 0.62 to 0.90 (median 0.75). Systematic review submission policies vary widely across high-impact ophthalmology journals, and most journal pairs show only modest similarity. Similarity-based guidance may help identify policy-aligned resubmission targets while anticipating common sources of reformatting burden.

3
Vision-Language Models vs Autonomous AI Agents for Anterior Capsular Radial Folds: A Diagnostic Study

Zhang, Y.; chen, l.; Zhao, W.; Zhang, H.; Qiao, C.; Liu, Z.; Chung, C. H.; Tan, M. C. J.; Wang, M.; Tham, Y. C.; Koh, V.; Cheng, C.; Liu, D.

2026-01-16 ophthalmology 10.64898/2026.01.15.26344200 medRxiv
Top 0.1%
18.8%
Show abstract

ImportanceEarly intraoperative warning signs of zonular instability during cataract surgery, such as anterior capsular radial folds, are subtle and easily missed but are clinically important for preventing surgical complications. Whether current artificial intelligence (AI) systems can reliably detect such subtle warning signs in real-world surgical video remains unknown. Recently, automated AI model generators have become available, enabling the automatic construction of task-specific AI models for individual clinical tasks. ObjectiveTo evaluate the diagnostic performance of general-purpose and automated task-specific artificial intelligence systems for detecting anterior capsular radial folds during cataract surgery and to compare their performance with human clinicians. Design, Setting, and ParticipantsThis retrospective diagnostic study used 537 continuous curvilinear capsulorhexis (CCC) video clips collected from Beijing Tongren Hospital (China), National University Hospital (Singapore), and the OphNet-APTOS public dataset. ExposurePresence or absence of anterior capsular radial folds during CCC, annotated at both clip and frame levels by senior glaucoma surgeons based on expert consensus. Main Outcomes and MeasuresDiscrimination between fold-positive and fold-negative cases was assessed using macro-averaged precision, recall, and F1 score at the frame and clip levels. Performance was compared among general-purpose AI systems, task-specific models generated by an automated AI model generator, and human graders with different levels of clinical experience. ResultsAmong 537 video clips (mean 7.32 seconds), 156 (29.1%) were fold-positive. General-purpose AI systems showed limited and inconsistent performance; the best-performing model achieved a mean F1 score of 0.519, and fine-tuned models remained inferior to human graders (maximum F1 score, 0.606). In contrast, task-specific models generated by an automated AI model generator achieved substantially higher performance (F1 score, 0.869; area under the receiver operating characteristic curve, 0.958). In head-to-head comparison with clinicians, the top automated task-specific model (F1 score, 0.835) matched the performance of junior specialists (mean F1 score, 0.829) but remained below that of senior specialists. Conclusions and RelevanceGeneral-purpose artificial intelligence systems do not reliably detect subtle intraoperative warning signs during cataract surgery and consistently underperform human clinicians. In contrast, recently available automated AI model generators enable the creation of task-specific models with near-junior specialist performance. These findings suggest that clinically reliable surgical AI is more likely to be achieved through automated generation of task-specific models rather than through general-purpose AI systems. Although evaluated in cataract surgery, these findings highlight a broader challenge for artificial intelligence in detecting brief, low-contrast intraoperative warning signs in surgical video. Key PointsO_ST_ABSQuestionC_ST_ABSHow reliably can general-purpose artificial intelligence (AI) systems and task-specific AI models generated by an automated AI model generator detect subtle intraoperative warning signs during cataract surgery compared with human clinicians? FindingsIn this multicenter diagnostic study of 537 cataract surgery video clips, general-purpose AI systems were unreliable and consistently underperformed human clinicians in detecting anterior capsular radial folds. In contrast, task-specific AI models generated by an automated AI model generator--a technology that has only recently become available--achieved substantially higher diagnostic performance and matched the performance of junior specialists. MeaningGeneral-purpose AI systems show limited reliability for detecting subtle intraoperative warning signs during cataract surgery. The recent availability of automated AI model generators enables a new paradigm of task-specific model development and represents a more clinically viable path for surgical decision support.

4
GlaucoRAG: A Retrieval-Augmented Large Language Model for Expert-Level Glaucoma Assessment

Aminan, M. I.; DARNELL, S. S.; Delsoz, M.; Nabavi, S. A.; Wright, C.; Kanner, E.; Jerkins, B.; Yousefi, S.

2025-07-07 ophthalmology 10.1101/2025.07.03.25330805 medRxiv
Top 0.1%
18.6%
Show abstract

PurposePurpose: Accurate glaucoma assessment is challenging because of the complexity and chronic nature of the disease; therefore, there is a critical need for models that provide evidence-based, accurate assessment. The purpose of this study was to evaluate the capabilities of a glaucoma specialized Retrieval-Augmented Generation (RAG) framework (GlaucoRAG) that leverages a large language model (LLM) for diagnosing glaucoma and answering to glaucoma specific questions. DesignEvaluation of diagnostic capabilities and knowledge of emerging technologies in glaucoma assessment. ParticipantsDetailed case reports from 11 patients and 250 multiple choice questions from the Basic and Clinical Science Course (BCSC) Self-Assessment were used to test the LLM based GlaucoRAG. No human participants were involved. MethodsWe developed GlaucoRAG, a RAG framework leveraging GPT-4.5-PREVIEW integrated with the R2R platform for automated question answering in glaucoma. We created a glaucoma knowledge base comprising more than 1,800 peer-reviewed glaucoma articles, 15 guidelines and three glaucoma textbooks. The diagnostic performance was tested on case reports and multiple-choice questions. Model outputs were compared with the independent answers of three glaucoma specialists, DeepSeek-R1, and GPT-4.5-PREVIEW (without RAG). Quantitative performance was further assessed with the RAG Assessment (RAGAS) framework, reporting faithfulness, context precision, context recall, and answer relevancy. Main Outcome MeasuresThe primary outcome measure was GlaucoRAGs diagnostic accuracy on patient case reports and percentage of correct responses to the BCSC Self-Assessment glaucoma items, compared with the performance of glaucoma specialists and two benchmark LLMs. Secondary outcomes included RAGAS sub scores. ResultsGlaucoRAG achieved an accuracy of 81.8% on glaucoma case reports, compared with 72.7% for GPT-4.5-PREVIEW and 63.7% for DeepSeek-R1. On glaucoma BCSC Self-Assessment questions, GlaucoRAG achieved 91.2% accuracy (228 / 250), whereas GPT-4.5-PREVIEW and DeepSeek-R1 attained 84.4% (211 / 250) and 76.0% (190 / 250), respectively. The RAGAS evaluation returned an answer relevancy of 91%, with 80% context recall, 70% faithfulness, and 59% context precision. ConclusionsThe glaucoma-specialized LLM, GlaucoRAG, showed encouraging performance in glaucoma assessment and may complement glaucoma research and clinical practice as well as question answering with glaucoma patients.

5
Identification of Risk Factors for Glaucoma Progression in Free-Text Clinical Notes using a Local Small Language Model

Bhatnagar, A.; Scherer, R.; Samico, G. A.; Muralidhar, R.; Gutkind, N. E.; Palazoni, V.; Medeiros, F. A.; Swaminathan, S. S.

2025-09-29 ophthalmology 10.1101/2025.09.26.25336746 medRxiv
Top 0.1%
18.4%
Show abstract

PurposeTo evaluate the performance of a large language model (LLM) in identifying medication non-adherence, visit non-adherence, and family history of glaucoma (FHoG) in clinical notes from the electronic health record (EHR). MethodsWe extracted clinical notes of 1,250 glaucoma-related encounters between 2014 and 2024 and structured EHR family history field data from the Bascom Palmer Ophthalmic Repository, with 125 randomly selected notes (10%) used for prompt development and excluded from analysis. Two fellowship-trained glaucoma specialists labeled notes for evidence of non-adherence and FHoG. We utilized MedGemma-27B-text-it, a specialized medical LLM, to identify medication non-adherence, visit non-adherence, and FHoG. We calculated accuracy, sensitivity, and specificity of LLM performance for each task, Jaccard index for FHoG, and mean squared error (MSE) of number of family members with glaucoma. ResultsPrevalence of medication non-adherence, visit non-adherence, and FHoG were 7.3%, 4.7%, and 29.2%, respectively. LLM accuracy was 0.91 (sensitivity: 0.96; specificity: 0.91) for medication non-adherence and 0.96 (sensitivity: 0.97; specificity: 0.94) for visit non-adherence. For FHoG, LLM accuracy was 0.98 (sensitivity: 0.99; specificity: 0.99) with Jaccard index of 0.99, while EHR family history field accuracy and Jaccard index were 0.49 and 0.75, respectively. LLM and EHR MSE in quantifying the number of relatives with glaucoma were 0.05{+/-}0.56 and 0.85{+/-}1.80, respectively (p<0.001). ConclusionsLLMs identified non-adherence to medication and visit schedules as well as degree of FHoG in clinical notes with high accuracy. Translational RelevanceLocal LLM pipelines can enable large-scale research into glaucoma risk factors that are unavailable in discrete EHR fields.

6
Development and validation of a user-friendly smartphone imaging and telemedicine platform for remote diagnosis of anterior segment eye disease

Parikh, K. S.; Reddy, K.; Shuff, J.; Kong, X.; Li, X.; Hariharakumar, S.; Santhanaraj, V. A.; Kakoty, R.; Kamam, B.; Ravilla, P. K.; Vivekanand, N.; Hoopes, M.; Detels, K.; Mohseni, N.; Verma, R.; Shumeyko, D.; Yadalla, D.; Venkatesh, R.; Shekhawat, N. S.

2025-11-15 ophthalmology 10.1101/2025.11.11.25339801 medRxiv
Top 0.1%
15.0%
Show abstract

BackgroundCataract and anterior segment diseases are leading causes of blindness in low-resource settings. Eye camp screenings remain the primary mode of community outreach but are constrained by cost, logistics, and dependence on highly trained specialists. We designed and validated a low-cost, user-friendly smartphone-based anterior segment imaging and teleophthalmology platform to enable community health workers (CHWs) to perform diagnostic-quality eye screening. MethodsWe designed a portable imaging device (ScoutTM) paired with an accessible Android smartphone and mobile application (InSightful) for telemedicine in low-bandwidth settings. CHWs underwent 3 hours of training on using the imaging software and hardware, then screened patients across 19 rural eye camps in South India with anterior segment images and clinical data uploaded to a cloud-based database for remote ophthalmologist (RO) review. Diagnoses and referral decisions made by ROs were compared with those of in-person eye camp ophthalmologists (ECOs). CHWs, ROs, and patients were surveyed on the platforms feasibility and acceptability. FindingsN=1093 patients underwent eye camp screening by ECOs and CHW-led smartphone screening with RO review. CHWs completed screenings in <2.5 minutes/eye and obtained diagnostic-quality images for >90% of eyes. ROs and ECOs showed 96.1% concordance for referral decisions (95% CI 94.7-97.1) and substantial agreement in diagnosis of any cataract ({kappa} 0.77, percent agreement [PA] 89%), mature cataract ({kappa} 0.67, PA 96%), immature cataract ({kappa} 0.69, PA 85%), clear crystalline lens ({kappa} 0.65, PA 89%), pseudophakia ({kappa} 0.92, PA 97%), and moderate agreement for pterygium ({kappa} 0.47, PA 94%). Concordance increased with image quality. CHWs, ROs, and patients reported high usability, acceptability, and net promoter scores. InterpretationScoutTM anterior segment screening by minimally trained CHWs achieves diagnostic and referral accuracy comparable to in-person ophthalmologist examinations, supporting potential to decentralize cataract screening and expand access to eye care in low-resource settings. FundingNational Eye Institute R21EY034343, National Eye Institute K23EY032988, National Eye Institute P30EY01765 (Biostatistics Core), Microsoft Innovation Acceleration Award, Johns Hopkins Center for Global Health, Stephen F Raab and Mariellen Brickley-Raab Rising Professorship in Ophthalmology, Boone Pickens Rising Professorship in Ophthalmology

7
Impact of Anti-VEGF Treatment for Diabetic Macular Oedema on Progression to Proliferative Diabetic Retinopathy: Data-driven Insights from a Multicentre Study

Olvera-Barrios, A.; Lilaonitkul, W.; Heeren, T. F.; Rozenberg, A.; Thomas, D.; Warwick, A. N.; Somroo, T.; Alsaedi, A. H.; Schwartz, R.; Chakravarthy, U.; Eleftheriadis, H.; Patwardhan, A.; Ghanchi, F.; Taylor, P.; Tufail, A.; Egan, C.; UK DR EMR Users Group,

2023-11-12 ophthalmology 10.1101/2023.11.10.23298261 medRxiv
Top 0.1%
14.8%
Show abstract

BackgroundTo report insights on proliferative-diabetic-retinopathy (PDR) risk modification with repeated anti-vascular endothelial-growth-factor (VEGF) injections for the treatment of diabetic-macular-oedema (DMO) in routine care, and present data-driven PDR screening recommendations for injection clinics. MethodsMulticentre study (27 UK-NHS centres) of patients with non-PDR with and without DMO. Primary outcome was PDR development. Repeated anti-VEGF injections were modelled as time-dependent covariates using Cox regression and weighted cumulative exposure (WCE) adjusting for baseline diabetic retinopathy (DR) grade, age, sex, ethnicity, type of diabetes, and deprivation. A propensity score matched cohort was used to estimate the treatment effect on PDR incidence rates (IR). ResultsWe included 5716 NPDR eyes (5716 patients, 2858 DMO eyes). The WCE method showed a better model fit. Anti-VEGF injections showed a protective effect on risk of PDR during the most recent 4-weeks from exposure which rapidly decreased. There was a 20% reduction in risk of PDR (p0.006) in treated eyes. Severe-NPDR had a 4.6-fold increase in PDR hazards when compared with mild-NPDR (p<0.001). The annual IR of untreated mild-NPDR cases was 2.3 [95%CI 1.57-3.23] per 100 person-years). In NPDR DMO cases treated with anti-VEGF, similar IR would occur with annual review for mild, 6-monthly for moderate, and 3-monthly for severe-NPDR. ConclusionThe WCE method is a better modelling strategy than traditional Cox models for repeated exposures in ophthalmology. Injections are protective against PDR predominantly within the most recent 4 weeks. Based on observed data, we suggest follow-up recommendations for PDR detection according to retinopathy grade at first injection. PrecisThis study describes the impact on PDR risk of anti-VEGF injections for DMO in routine care and data-driven reassessment recommendations of the peripheral retina for people in long term injection clinics. Key messages What is already known on this topic- Clinical trials have shown that intravitreal anti-vascular endothelial growth factor (VEGF) injections reduce the incidence rate of proliferative diabetic retinopathy (PDR). - Repeated intravitreal anti-VEGF injections are the mainstay of treatment for diabetic macular oedema (DMO), however, there is little evidence on how these exposures impact on the risk of PDR in clinical practice. What this study adds- The impact of anti-VEGF on PDR risk varies based on the timing of exposure and the effect is not permanent. - Despite repeated treatments with anti-VEGF injections, patients with DMO may still progress to PDR. How this study might affect research, practice, or policy- Our work underscores the significance of taking into account repeated treatments at varying time intervals in ophthalmology, highlighting the utility of the weighted cumulative exposure method. - Implementing adequate modelling strategies to address the complexities of exposures in clinical settings can improve predictions and patient outcomes. - We provide PDR screening recommendations for DMO patients undergoing anti-VEGF treatments in injection clinics. Implementation would improve the safety and efficiency of treatment pathways.

8
Diagnostic test accuracy of artificial intelligence in screening for referable diabetic retinopathy in real-world settings: A systematic review and meta-analysis

Uy, H.; Fielding, C.; Hohlfeld, A.; Ochodo, E.; Opare, A.; Mukonda, E.; Minnies, D.; Engel, M. E.

2023-06-22 ophthalmology 10.1101/2023.06.20.23291687 medRxiv
Top 0.1%
14.7%
Show abstract

Studies on artificial intelligence (AI) in screening for diabetic retinopathy (DR) have shown promising results in addressing the mismatch between the capacity to implement DR screening and the increasing DR incidence; however, most of these studies were done retrospectively. This review sought to evaluate the diagnostic test accuracy (DTA) of AI in screening for referable diabetic retinopathy (RDR) in real-world settings. We searched CENTRAL, PubMed, CINAHL, Scopus, and Web of Science on 9 February 2023. We included prospective DTA studies assessing AI against trained human graders (HGs) in screening for RDR in patients living with diabetes. synthesis Two reviewers independently extracted data and assessed methodological quality against QUADAS-2 criteria. We used the hierarchical summary receiver operating characteristics (HSROC) model to pool estimates of sensitivity and specificity and, forest plots and SROC plots to visually examine heterogeneity in accuracy estimates. Finally, we conducted sensitivity analyses to explore the effects of studies deemed to possibly affect the quality of the studies. We included 15 studies (17 datasets: 10 patient-level analysis (N=45,785), and 7 eye-level analysis (N=15,390). Meta-analyses revealed a pooled sensitivity of 95.33%(95% CI: 90.60-100%) and specificity of 92.01%(95% CI: 87.61-96.42%) for patient-level analysis; for the eye-level analysis, pooled sensitivity was 91.24% (95% CI: 79.15-100%) and specificity, 93.90% (95% CI: 90.63-97.16%). Subgroup analyses did not provide variations in the diagnostic accuracy of country classification and DR classification criteria; however, a moderate increase was observed in diagnostic accuracy at the primary-level and, a minimal decrease in the tertiary-level healthcare settings. Sensitivity analyses did not show any variations in studies that included diabetic macular edema in the RDR definition, nor in studies with [&ge;]3 HGs. This review provides evidence, for the first time from prospective studies, for the effectiveness of AI in screening for RDR, in real-world settings.

9
Comparative Performance of retinIA, an AI-powered Ophthalmic Screening Tool, and First-Year Residents in Retinal Disease Detection and Glaucoma Assessment: A Study in a Mexican Tertiary Care Setting

Camacho-Garcia-Formenti, D.; Baylon-Vazquez, G.; Arriozola-Rodriguez, K. J.; Avalos-Ramirez, L. E.; Hartleben-Matkin, C.; Valdez Flores, H. F.; Hodelin-Fuentes, D.; Noriega Campero, A.

2024-08-28 ophthalmology 10.1101/2024.08.26.24311677 medRxiv
Top 0.1%
14.6%
Show abstract

BackgroundArtificial intelligence (AI) shows promise in ophthalmology, but its potential on tertiary care settings in Latin America remains understudied. We evaluated a Mexican AI-powered screening tool, against first-year ophthalmology residents in a tertiary care setting in Mexico City. MethodsWe analysed 435 adult patients undergoing their first ophthalmic evaluation. AI and residents assessments were compared against expert annotations for retinal disease, cup-to-disk ratio (CDR) measurements, and glaucoma suspect classification. We also evaluated a synergistic approach combining AI and resident assessments. ResultsFor glaucoma suspect classification, AI outperformed residents in accuracy (88.6% vs 82.9%, p = 0.016), sensitivity (63.0% vs 50.0%, p = 0.116), and specificity (94.5% vs 90.5%, p = 0.062). The synergistic approach deemed a higher sensitivity (80.4%) than ophthalmic residents alone or AI alone (p < 0.001). AIs CDR estimates showed lower mean absolute error (0.056 vs 0.105, p < 0.001) and higher correlation with expert measurements (r = 0.728 vs r = 0.538). In retinal disease assessment, AI demonstrated higher sensitivity (90.1% vs 63.0% for medium/high-risk, p < 0.001) and specificity (95.8% vs 90.4%, p < 0.001). Furthermore, differences between AI and residents were statistically significant across all metrics. The synergistic approach achieved the highest sensitivity for retinal disease (92.6% for medium/high-risk, 100% for high-risk). ConclusionAI outperforms first-year residents in key ophthalmic assessments. The synergistic use of AI and resident assessments shows potential for optimizing diagnostic accuracy, highlighting the value of AI as a supportive tool in ophthalmic practice, especially for early-career clinicians.

10
Racial and Sociodemographic Disparities in Blindness Associated with Primary Angle Closure Glaucoma in the United States: An IRIS(R) Registry Analysis

Shah, S. N.; Zhou, S.; Sanvicente, C.; Burkemper, B.; Apolo, G.; Li, C.; Li, S.; Liu, L.; Lum, F.; Moghimi, S.; Xu, B.

2022-08-27 ophthalmology 10.1101/2022.08.26.22279190 medRxiv
Top 0.1%
14.2%
Show abstract

PurposeTo assess the prevalence and risk factors of blindness among patients newly diagnosed with primary angle closure glaucoma (PACG) in the United States (US). DesignRetrospective cross-sectional study of patients from the American Academy of Ophthalmology IRIS(R) (Intelligent Research in Sight) Registry. ParticipantsPatients in the IRIS(R) Registry between the years 2015 to 2019 with a new diagnosis of PACG and visual acuity (VA) data on or within 90 days prior to the date of diagnosis. MethodsEligible patients were aged 18 years and older and: (1) were observable in the database at least 24 months prior to the index date of PACG diagnosis; (2) had no history of intraocular pressure (IOP) lowering drops, laser peripheral iridotomy (LPI), cataract surgery, or a diagnosis of pseudophakia unless preceded by a diagnosis of anatomical narrow angle (ANA); and (3) had no history of glaucoma surgery. Multivariable logistic regression models were developed to assess risk factors of blindness. Main Outcome MeasuresAny (one or both eyes) or bilateral (both eyes) blindness (VA [&le;] 20/200) at first diagnosis of PACG. Results43,901 patients with PACG in the IRIS(R) Registry met inclusion criteria. Overall prevalence of any and bilateral blindness were 11.5% and 1.8%, respectively. Black and Hispanic patients were at higher risk of any (OR=1.42 and 1.21, respectively; p<0.001) and bilateral (OR=2.04 and 1.53, respectively; p<0.001) blindness compared to non-Hispanic White patients adjusted for ocular comorbidities, including cataracts. Other factors associated with any blindness included age <50 or >80 years, male sex, Medicaid or Medicare insurance category, and Southern or Western practice region (ORs>1.28; p[&le;]0.01). Diagnosis of ANA prior to diagnosis of PACG was protective against any (OR=0.56; p<0.001) and bilateral (OR=0.61; p<0.001) blindness. ConclusionsBlindness affects 1 out of 9 patients with newly diagnosed PACG in the IRIS(R) Registry; Black and Hispanic patients and Medicaid and Medicare recipients are significantly more vulnerable. These findings highlight the severe ocular morbidity associated with PACG and the need for increased disease awareness and improved detection methods.

11
Patient and Practice Level Visual Acuity Prior to Cataract Surgery: An IRIS(R) Registry (Intelligent Research in Sight) Analysis

Tainsh, L.; Douglas, V. P.; Gilbert, J. B.; Ross, C. J.; Manz, S.; Kearney, W.; Elze, T.; Miller, J. W.; Lorch, A. C.

2025-07-08 ophthalmology 10.1101/2025.07.07.25331037 medRxiv
Top 0.1%
14.2%
Show abstract

PurposeTo examine the influence of patient demographic characteristics and ophthalmic practice composition on access to cataract surgery in the United States as measured by preoperative best-corrected visual acuity (BCVA). Patient and methodsThis retrospective cohort study analyzed data from the IRIS(R) Registry (Intelligent Research in Sight) for patients age > 50 who had at least one BCVA measurement in the six months preceding cataract surgery performed between January 1, 2016, and December 31, 2020. We used a mixed-effects model to estimate the relationship between individual-level demographic factors and practice-level composition factors and preoperative BCVA. Results2,387,045 individuals met inclusion criteria. The mean BCVA prior to surgery was 0.23 (SD: 0.32) logMAR. The worst pre-operative BCVA was observed in patients with Hispanic race and ethnicity while White patients had the best [0.34 (SD: 0.43), 0.21(SD: 0.30); p<0.001]. Grouping patients in terms of percentage of BCVA worse than 20/50 prior to surgery, Hispanic patients, active smokers, and uninsured patients had higher percentages of worse preoperative vision (33.7%, 23.5%, 34.9%). Analysis of compositional effects of race and ethnicity, smoking and insurance status showed that, regardless of an individual patients demographic, patients treated at practices serving higher proportions of White patients showed better BCVA (b = -.008 per 10 percentage points, P < .001) while patients at practices with higher percentages of actively smoking patients showed worse BCVA (b=-0.016 per 10 percentage points active smoking patients, P < .001). There was no compositional effect of insurance status. Conclusions and RelevanceOverall differences exist with regard to the visual acuity at which cataract surgery is initiated at both the level of the individual patient and the composition of practice in which they are treated. Plain Language SummaryDemographic disparities and geographic variation in access to cataract surgery in the United States have been previously described in large national studies of insurance data. Smaller studies of single institutions expanded upon these studies by showing differences in preoperative visual acuity- an important measure of access to cataract surgery- based on factors such as race and insurance status but were limited by the size and scope of their study patients. The IRIS(R) Registry (Intelligent Research in Sight) is the nations first comprehensive ophthalmic clinical registry with data from both individual patients as well as ophthalmic group practices. Using data from this registry, we show differences in preoperative visual acuity prior to cataract surgery at both the level of the patient and the practice in which they are treated.

12
OphthUS-GPT: Multimodal AI for Automated Reporting in Ophthalmic B-Scan Ultrasound

Gan, F.; Chen, l.; Qin, W. g.; Han, Q. l.; Long, X.; Fan, H. m.; li, X. y.; Yu, H. z.; Zhang, J. h.; Xu, N.; Cheng, J. x.; Cao, J.; Liu, K. c.; Shao, Y. n.; Li, X. n.; Wan, Q.; Liu, T.; You, Z. p.

2025-03-04 ophthalmology 10.1101/2025.03.03.25323237 medRxiv
Top 0.1%
14.0%
Show abstract

IMPORTANCEThe rapid advancement of AI in ophthalmology is transforming diagnostics, especially in resource-limited settings. The shortage of ophthalmologists and lack of standardized reporting creates an urgent need for AI systems capable of automated reporting and interactive decision support. OBJECTIVETo develop OphthUS-GPT, a multimodal AI system integrating BLIP and DeepSeek models for automated report generation and clinical decision support from ophthalmic B-scan ultrasound images. DESIGN, SETTING, AND PARTICIPANTSThis retrospective study at the Affiliated Eye Hospital of Jiangxi Medical College collected B-scan ultrasound reports between 2017-2024, including 54,696 images and 9,392 reports from 31,943 patients (mean age 49.14{+/-}0.124 years, 50.15% male). MAIN OUTCOMES AND MEASURESEvaluation included two components: diagnostic report generation and question-answering system assessment. Report generation was evaluated using text metrics (ROUGE-L, CIDEr), disease classification metrics (accuracy, sensitivity, specificity, precision, F1 score), and ophthalmologist ratings for accuracy and completeness. The question-answering system was assessed by ophthalmologists rating answers on accuracy, completeness, potential harm, and satisfaction. RESULTSOphthUS-GPT achieved ROUGE-L and CIDEr scores of 0.6131 and 0.9818 in report generation. For common conditions, accuracy exceeded 90% with precision >70%. Expert assessment showed >90% of reports scored [&ge;] 3/5 for correctness and 96% for completeness. The DeepSeek-R1-Distill-Llama-8B (DeepSeek) question-answering component performed comparably to GPT4o and OpenAI-o1, outperforming other models. CONCLUSIONS AND RELEVANCOphthUS-GPT demonstrated excellent performance in automatic report generation and intelligent Q&A, offering a novel solution for ophthalmic ultrasound interpretation and clinical decision support.

13
Safety of Pharmacologic Dilation: Acute Angle Closure Incidence in a Los Angeles County-Wide Safety Net Teleretinal Screening Program

Lang, T.; Xu, B. Y.; Li, Z.; Iyengar, S.; Kesselman, C.; Ambite, J.-L.; Bolo, K.; Do, J.; Wong, B.; Daskivich, L.

2025-06-28 ophthalmology 10.1101/2025.06.26.25330091 medRxiv
Top 0.1%
12.9%
Show abstract

ImportancePharmacologic dilation is vital for eye disease screening but is often avoided due to concerns about triggering acute angle closure (AAC), a sight-threatening ophthalmic emergency. ObjectiveTo assess AAC incidence after dilation and validate the use of International Classification of Diseases (ICD) codes for identifying AAC cases. DesignRetrospective cohort study SettingPrimary care-based teleretinal diabetic retinopathy screening (TDRS) program ParticipantsEligible participants were Los Angeles County (LAC) Department of Health Services (DHS) patients who underwent teleretinal screening by dilated fundus photography between August 23, 2013, and March 1, 2024. Potential AAC cases were identified using ICD codes for angle closure, including acute angle closure glaucoma (AACG), primary angle closure glaucoma (PACG), and anatomical narrow angle (ANA), within three months of dilation. All urgent care, emergency department, and eye clinic encounters within the next calendar day after TDRS and encounters with Current Procedural Terminology (CPT) codes for iridectomy/iridotomy or lens extraction within 14 calendar days of TDRS were also identified. Manual chart review was conducted to verify AAC cases and extract clinical information. ExposuresDilation with 1.0% or 0.5% tropicamide. Main Outcomes and MeasuresCumulative incidence of AAC after dilation. Results84,008 patients received 168,796 dilations with a mean of 2.01 {+/-} 1.50 (mean {+/-} standard deviation) dilations per patient. 55.1% were female. Mean age was 55.4 {+/-} 10.7 (mean {+/-} standard deviation) years. The cohort was 67.7% Hispanic, 8.2% Black, 6.3% Asian, 4.1% White, and 2.4% Other. Manual chart review confirmed four AAC cases after dilation: 3 coded as AACG and 1 as ANA. The AAC risk was 2.4 (95% CI 0.05-4.69) per 100,000 dilations (0.0024%) or 4.8 (95% CI 0.10-9.43) per 100,000 patients (0.0048%). All four cases were female, had narrow angles in the non-presenting eye on gonioscopy, and presented within one day with AAC symptoms, including eye pain and blurry vision. Conclusions and RelevanceAAC risk was less than 1 in 40,000 per dilation in a high-volume TDRS program serving a diverse, safety net population, supporting the overall safety of dilation in this setting. Further discussion about AAC risk as a contraindication to dilation is warranted.

14
The Inherited Retinal Disease Pathway in the United Kingdom: a Patient Perspective and the Potential of AI

Wong, W.; Sumodhee, D.; Morris, T. A.; Tailor, B.; Hollyhead, C.; Woof, W. A.; Archer, S.; Veal, C.; Lobo, L.; Daich Varela, M.; Cabral De Guimaraes, T. A.; Gomes, M.; Shah, M.; Downes, S. M.; Madhusudhan, S.; Mahroo, O. A.; Webster, A. R.; Michaelides, M.; Pontikos, N.

2025-01-14 ophthalmology 10.1101/2025.01.14.25320497 medRxiv
Top 0.1%
12.6%
Show abstract

BackgroundInherited Retinal Diseases (IRDs) are the leading cause of blindness in young people in the UK. Despite significant improvements in genomics medicine, diagnosis of these conditions remains challenging, with many patients enduring lengthy diagnostic odysseys and even after genetic testing around 40% of them do not receive a definite genetic diagnosis. This survey aims to explore the experience of individuals affected with IRDs, their relatives, friends and caregivers, and the potential acceptability of an AI technology, such as Eye2Gene. MethodsThis cross-sectional survey was distributed electronically using the Qualtrics-encrypted platform between April to August 2024. The mixed-methods survey included Likert-scale and open-ended queries. Analysis was performed using descriptive statistics and content methods. ResultsThe survey was answered by 247 respondents of which 79.8% were patients and the remainder were relatives, friends and caregivers. There was substantial variability in patient diagnostic journeys in terms of waiting times to see a specialist (IQR 1 to 4 years), commute required (IQR 10 to 74 miles) and number of visits to reach a diagnosis (IQR 2 to 4). A substantial proportion of patients had a change in diagnosis had a change in diagnosis (35.8%). The majority of respondents were overwhelmingly in favour of the integration of AI into the IRD pathway to accelerate genetic diagnosis care (>90%). ConclusionThis survey identifies several key gaps and disparities in the IRD pathway which can be addressed in part by the integration of AI for more equitable care. Survey also revealed a favourable attitude towards incorporating AI into the diagnostic testing of IRDs. SynopsisA survey by 247 people directly or indirectly affected by inherited retinal diseases in the UK reports substantial gaps and disparities in the patient diagnostic pathway which could in part be addressed by Artificial Intelligence.

15
Comparison of Foundation and Supervised Learning-Based Models for Detection of Referable Glaucoma from Fundus Photographs

Bolo, K.; Nguyen, T. H.; Iyengar, S.; Li, Z.; Nguyen, V.; Wong, B.; Do, J.; Ambite, J.-L.; Kesselman, C.; Daskivich, L.; Xu, B.

2025-08-24 ophthalmology 10.1101/2025.08.21.25334170 medRxiv
Top 0.1%
12.5%
Show abstract

PurposeTo compare the performance of a foundation model and a supervised learning-based model for detecting referable glaucoma from fundus photographs. DesignEvaluation of diagnostic technology. Participants6,116 participants from the Los Angeles County Department of Health Services Teleretinal Screening Program. MethodsFundus photographs were labeled for referable glaucoma (cup-to-disc ratio [&ge;] 0.6) by certified optometrists. Four deep learning models were trained on cropped and uncropped images (Training N = 8,996; Validation N = 3,002) using two architectures: a vision transformer with self-supervised pretraining on fundus photographs (RETFound) and a convolutional neural network (VGG-19). Models were evaluated on a held-out test set (N = 1,000) labeled by glaucoma specialists and an external test set (N = 300) from University of Southern California clinics. Performance was assessed while varying training set size and stratifying by demographic factors. xRAI was used for saliency mapping. Main Outcome MeasuresArea under the receiver operating characteristic curve (AUC-ROC) and threshold-specific metrics. ResultsThe cropped image VGG-19 model achieved the highest AUC-ROC (0.924 [0.907-0.940]), which was comparable (p = 0.07) to the cropped image RETFound model (0.911 [0.892-0.930]), which achieved the highest Youden-optimal performance (sensitivity 82.6%, specificity 88.2%) and F1 score (0.801). Cropped image models outperformed their uncropped counterparts within each architecture (p < 0.001 for AUC-ROC comparisons). RETFound models had a performance advantage when trained on smaller datasets (N < 2000 images), and the uncropped image RETFound model performed best on external data (p < 0.001 for AUC-ROC comparisons). The cropped image RETFound model performed consistently across ethnic groups (p = 0.20), while the others did not (p < 0.04); performance did not vary by age or gender. Saliency maps for both architectures consistently included the optic nerve. ConclusionWhile both RETFound and VGG-19 models performed well for classification of referable glaucoma, foundation models may be preferable when training data is limited and when domain shift is expected. Training models using images cropped to the region of the optic nerve improves performance regardless of architecture but may reduce model generalizability.

16
Machine Learning-Based Prediction of Postoperative Refraction in Cataract Surgery: A Stacking Ensemble Approach

Ipek-Ugay, S.; Zeyadi, G.

2026-01-29 ophthalmology 10.64898/2026.01.24.26344648 medRxiv
Top 0.1%
12.4%
Show abstract

BackgroundAchieving precise postoperative refractive outcomes remains a significant challenge in cataract surgery. While advanced intraocular lens (IOL) power calculation formulas exist, they are constrained by their singular algorithmic structures. This study investigated whether a stacking ensemble machine learning approach could overcome these limitations. MethodsA dataset of 1,710 eyes from patients who underwent cataract surgery with monofocal IOL implantation (Vivinex or SA60AT) was utilized. Following rigorous preprocessing and feature engineering, a stacking ensemble architecture was developed comprising three diverse base learners (Multi-Layer Perceptron, Support Vector Regressor with RBF kernel, and SplineTransformer with Linear Regression) and a Ridge Regressor meta-learner. The model was trained on 80% of the data using 5-fold cross-validation and evaluated on an independent 20% test set (n=341). Performance was compared against six standard IOL formulas. ResultsThe stacking ensemble model demonstrated excellent predictive accuracy, achieving a Mean Absolute Error (MAE) of 0.272 D on the independent test set (n=341). The model achieved lower MAE compared to all six standard IOL formulas, including Kane (MAE 0.295 D) and Barrett Universal II (MAE 0.318 D). Clinically, 85.1% of eyes achieved predictions within {+/-}0.50 D, compared to 82.5% for Kane formula and 81.8% for Barrett Universal II. ConclusionThe stacking ensemble machine learning model significantly enhances postoperative refraction prediction accuracy compared to established IOL calculation formulas. By leveraging algorithmic diversity and data-driven learning, this approach represents a promising advancement toward reducing refractive surprises and improving patient satisfaction in cataract surgery. External validation on independent datasets is required to confirm generalizability.

17
Outcomes of the Advanced Visualization In Corneal Surgery Evaluation (ADVISE) trial; a non-inferiority randomized control trial to evaluate the use of intraoperative OCT during Descemet membrane endothelial keratoplasty

Muijzer, M. B.; Delbeke, H.; Dickman, M. M.; Nuijts, R. M. M. A.; Noordmans, H.-J.; Imhof, S. M.; Wisse, R. P. L.

2022-01-21 ophthalmology 10.1101/2022.01.18.22269460 medRxiv
Top 0.1%
12.4%
Show abstract

PurposeTo evaluate if an intraoperative OCT (iOCT) optimized surgical protocol without prolonged overpressure is non-inferior to a standard protocol during Descemet membrane endothelial keratoplasty (DMEK). DesignA multicenter international prospective non-inferiority randomized control trial SubjectsSixty-five pseudophakic eyes of 65 patients with corneal endothelial dysfunction resulting from Fuchs endothelial corneal dystrophy were enrolled in 3 corneal centers in The Netherlands and Belgium. MethodsThe study was powered to include 63 patients scheduled for routine DMEK. Subjects were randomized to the control arm (n=33) without iOCT-use and raising the intraocular pressure above normal physiological limits for 8 minutes (i.e., overpressure) or the intervention arm (n=32) with OCT-guidance to assess graft orientation and adherence while refraining from prolonged raising the intraocular pressure. The RD and 95% confidence intervals (95% CI) were calculated from a logistic regression model using 1,000 bootstrap samples. Secondary outcomes included the incidence of graft detachment, surgeon-reported iOCT-aided surgical decision making, surgical time, endothelial cell density (ECD), and corrected distance visual acuity (CDVA). Main Outcome MeasuresThe primary outcome was the incidence of postoperative surgery-related adverse events, defined as rebubbling, graft failure, and iatrogenic acute glaucoma. The non-inferiority margin was set at a risk difference (RD) of 10%. ResultsIn the control group, 13 adverse events were recorded in 10 subjects compared to 13 adverse events in 12 subjects in the intervention group. The mean unadjusted RD measured 0.38% (95%CI: - 9.64-10.64) and the RD adjusted for study site measured -0.32% (95%CI: -10.29-9.84). No significant differences in ECD and CDVA were found between the two groups 3 and 6 months postoperatively. Surgeons reported that iOCT aided surgical decision-making in 40% of cases. Surgical- and graft unfolding time were, respectively, 13% and 27% shorter in the iOCT-group. ConclusionsiOCT-guided DMEK surgery with refraining from prolonged over-pressuring was non-inferior compared to conventional treatment. Surgery times were reduced considerably, and surgeons reported the iOCT aided surgical decision-making in 40% of cases. Refraining from prolonged overpressure did not affect postoperative ECD or CDVA.

18
Performance of DeepSeek, Qwen 2.5 MAX, and ChatGPT Assisting in Diagnosis of Corneal Eye Diseases, Glaucoma, and Neuro-Ophthalmology Diseases Based on Clinical Case Reports

Hussain, Z. S.; Delsoz, M.; Elahi, M.; Jerkins, B.; Kanner, E.; Wright, C.; Yousefi, S.

2025-03-17 ophthalmology 10.1101/2025.03.14.25323836 medRxiv
Top 0.1%
12.3%
Show abstract

BackgroundThis study evaluates the diagnostic performance of several AI models, including Deepseek, in diagnosing corneal diseases, glaucoma, and neuro{square}ophthalmologic disorders. MethodsWe retrospectively selected 53 case reports from the Department of Ophthalmology and Visual Sciences at the University of Iowa, comprising 20 corneal disease cases, 11 glaucoma cases, and 22 neuro{square}ophthalmology cases. The case descriptions were input into DeepSeek, ChatGPT{square}4.0, ChatGPT{square}01, and Qwens 2.5 Max. These responses were compared with diagnoses rendered by human experts (corneal specialists, glaucoma attendings, and neuro{square}ophthalmologists). Diagnostic accuracy and interobserver agreement, defined as the percentage difference between each AI models performance and the average human expert performance, were determined. ResultsDeepSeek achieved an overall diagnostic accuracy of 79.2%, with specialty-specific accuracies of 90.0% in corneal diseases, 54.5% in glaucoma, and 81.8% in neuro{square}ophthalmology. ChatGPT{square}01 outperformed the other models with an overall accuracy of 84.9% (85.0% in corneal diseases, 63.6% in glaucoma, and 95.5% in neuro{square}ophthalmology), while Qwens exhibited a lower overall accuracy of 64.2% (55.0% in corneal diseases, 54.5% in glaucoma, and 77.3% in neuro{square}ophthalmology). Interobserver agreement analysis revealed that in corneal diseases, DeepSeek differed by -3.3% (90.0% vs 93.3%), ChatGPT{square}01 by -8.3%, and Qwens by -38.3%. In glaucoma, DeepSeek outperformed the human expert average by +3.0% (54.5% vs 51.5%), while ChatGPT{square}4.0 and ChatGPT{square}01 exceeded it by +12.1%, and Qwens was +3.0% above the human average. In neuro{square}ophthalmology, DeepSeek and ChatGPT{square}4.0 were 9.1% lower than the human average, ChatGPT{square}01 exceeded it by +4.6%, and Qwens was 13.6% lower. ConclusionsChatGPT{square}01 demonstrated the highest overall diagnostic accuracy, especially in neuro{square}ophthalmology, while DeepSeek and ChatGPT{square}4.0 showed comparable performance. Qwens underperformed relative to the other models, especially in corneal diseases. Although these AI models exhibit promising diagnostic capabilities, they currently lag behind human experts in certain areas, underscoring the need for a collaborative integration of clinical judgment. Plain Language SummaryThis study evaluated how well several artificial intelligence (AI) models diagnose eye diseases compared to human experts. We tested four AI systems across three types of eye conditions: diseases of the cornea, glaucoma, and neuro-ophthalmologic disorders. Overall, one AI model, ChatGPT-01, performed the best, correctly diagnosing about 85% of cases, and it excelled in neuro-ophthalmology by correctly diagnosing 95.5% of cases. Two other models, DeepSeek and ChatGPT-4.0, each achieved an overall accuracy of around 79%, while the Qwens model performed lower, with an overall accuracy of about 64%. When compared with human experts, who achieved very high accuracy in corneal diseases (93.3%) and neuro-ophthalmology (90.9%) but lower in glaucoma (51.5%), the AI models showed mixed results. In glaucoma, for instance, some AI models even outperformed human experts slightly, while in corneal diseases, all AI models were less accurate than the experts. These findings indicate that while AI shows promise as a supportive tool in diagnosing eye conditions, it still needs further improvement. Combining AI with human clinical judgment appears to be the best approach for accurate eye disease diagnosis. Key summary pointsO_LIWhy carry out this study? With the rising burden of eye diseases and the inherent diagnostic challenges for complex conditions like glaucoma and neuro-ophthalmologic disorders, there is an unmet need for innovative diagnostic tools to support clinical decision-making. C_LIO_LIWhat did the study ask? This study evaluated the diagnostic performance of four AI models across three ophthalmologic subspecialties, testing the hypothesis that advanced language models can achieve accuracy levels comparable to human experts. C_LIO_LIWhat was learned from the study? Our results showed that ChatGPT-01 achieved the highest overall accuracy (84.9%), excelling in neuro-ophthalmology with a 95.5% accuracy, while DeepSeek and ChatGPT-4.0 each achieved 79.2%, and Qwens reached 64.2%. C_LIO_LIWhat specific outcomes were observed? In glaucoma, AI model accuracies ranged from 54.5% to 63.6%, with some models slightly surpassing the human expert average of 51.5%, underscoring the diagnostic difficulty of this condition. C_LIO_LIWhat has been learned and future implications? These findings highlight the potential of AI as a valuable adjunct to clinical judgment in ophthalmology, although further research and the integration of multimodal data are essential to optimize these tools for routine clinical practice. C_LI

19
Open-source DeepSeek-R1 Outperforms Proprietary Non-Reasoning Large Language Models With and Without Retrieval-Augmented Generation

Song, S.; Peng, K. C.; Wang, E. T.; Liu, T. Y. A.

2025-09-14 ophthalmology 10.1101/2025.09.12.25334809 medRxiv
Top 0.1%
12.2%
Show abstract

ObjectiveTo compare reasoning large language models (LLMs) vs. non-reasoning LLMs and open-source DeepSeek models vs. proprietary LLMs in answering ophthalmology board-style questions. To quantify the impact of retrieval-augmented generation (RAG). DesignCross-sectional evaluation of LLM performance before and after RAG integration. SubjectsSeven LLMs: Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4 Turbo, GPT-4o, DeepSeek-V3, OpenAI-o1, and DeepSeek-R1. MethodsA RAG-integrated LLM workflow was developed using the American Academy of Ophthalmologys Basic and Clinical Science Course (Section 12: Retina and Vitreous) as an external knowledge source. The text was embedded into a Faiss vector database for retrieval. A curated set of 250 retina-related multiple-choice questions from OphthoQuestions was used for evaluation. Each model was tested under both pre-RAG (question-only) and post-RAG (question + retrieved context) conditions across 4 independent runs on the question set. Accuracy was calculated as the proportion of correct answers. Statistical analysis included paired t-tests, two-way ANOVA, and Tukeys HSD test. Main Outcome MeasuresAccuracy (percentage of correct answers). ResultsRAG integration significantly improved accuracy across all models (p < 0.01). Two-way ANOVA confirmed significant effects of LLM choice (p < 0.001) and RAG status (p < 0.001) on model accuracy. Accuracy ranged from 56.8% (Gemini 1.5 Pro) to 87.5% (OpenAI-o1) pre-RAG, and improved post-RAG to 76.3% and 89.8%, respectively. Reasoning models (OpenAI-o1, DeepSeek-R1) significantly outperformed non-reasoning models. Open-source models achieved near parity with proprietary counterparts: DeepSeek-V3 with RAG (80.7%) performed comparably with GPT-4o with RAG (80.9%). DeepSeek-R1 with RAG slightly underperformed compared to OpenAI-o1 with RAG (86.0% vs 89.8%), but otherwise outperformed all other evaluated models (p < 0.001). ConclusionOur findings demonstrate that reasoning models significantly outperformed non-reasoning models, and RAG significantly enhanced accuracy across all models. Open-source models, trained at significantly lower cost, achieved near parity with proprietary systems. The performance of DeepSeek-V3 and DeepSeek-R1 highlighted the viability of cost-efficient, customizable, locally deployable LLMs for clinical applications. Future research should explore model fine-tuning, prompt engineering, and alternative retrieval methods to further improve LLM accuracy and reliability in medicine.

20
Strabismus/Diplopia After Glaucoma Drainage Devices: A Systematic Review of Method-Dependent Rates, Time Courses, Risk Factors, and Managements

Alfatih, M.; Abu Serhan, H.; Elhusseiny, A.

2025-10-07 ophthalmology 10.1101/2025.10.05.25337358 medRxiv
Top 0.1%
12.2%
Show abstract

Withdrawal StatementThe authors have withdrawn this manuscript because there are some errors in the uploaded manuscript, and the authors are currently fixing these errors. The authors dont want to upload any incomplete manuscript. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.